Syntax-Based Statistical Machine Translation: A review
نویسندگان
چکیده
Ever since the incipient of computers and the very first introduction of artificial intelligence, machine translation has been a target goal — or better said, a dream that at some point in the past deemed impossible (ALPAC 1966). The problem that machine translation aims to solve is very simple: given a document/sentence in a source language, produce its equivalent in the target language. This problem is complicated because of the inherent ambiguity of languages: the same word can have different meaning based on the context, idioms plus many other computational factors. Moreover extra domain knowledge is needed for a high quality output. Early techniques to solve this problem were human-intensive via parsing, transfer rules and generation with the help of an Interlingua (Hutchins 1995). This approach, while performing well in restricted domains, is not scalable and not suitable for languages that we do not have a syntactic theory/parser for. In the last decade, statistical techniques using the noisy channel model dominated the field and outperformed classical ones (Brown et al. 1993), however one problem with statistical methods is that they do not employ enough linguistic-theory to produce a grammatically coherent output(Och et al. 2003). This is because these methods incorporate little or no explicit syntactical theory and it only captures elements of syntax implicitly via the use of an n-gram language model in the noisy channel framework, which ca not model long dependencies. The goal of syntax-based machine translation techniques is to incorporate an explicit representation of syntax into the statistical systems to get the best out of the two worlds: high quality output while not requiring intensive human efforts. In this report we will give an overview of various approaches for syntax-aware statistical machine translation systems developed,or proposed, in the lase two decades. In our survey, we will stress the tension between the expressivity of the model and the complexity of its associated training and decoding procedures. The rest of this report is organized as follows: first, Section 2, gives a brief overview of the basic statistical machine translation model that serves as the basis of the subsequent discussions, and motivates the need for deploying syntax in the translation pipeline. In Section 3, we discuss various formal grammar formalisms which were proposed to model parallel texts. Then in section 4, we describe how these theoretical ideas have been used to augment the basic models in Section 2, and detail how the resulting models are trained from data, as well as assessing their complexity against the extra accuracy gained. Finally we conclude in Section 5
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملA Dependency Edge-based Transfer Model for Statistical Machine Translation
Previous models in syntax-based statistical machine translation usually resort to some kinds of synchronous procedures, few of these works are based on the analysis-transfer-generation methodology. In this paper, we present a statistical implementation of the analysis-transfergeneration methodology in rule-based translation. The procedures of syntax analysis, syntax transfer and language genera...
متن کاملA New Subtree-Transfer Approach to Syntax-Based Reordering for Statistical Machine Translation
In this paper we address the problem of translating between languages with word order disparity. The idea of augmenting statistical machine translation (SMT) by using a syntax-based reordering step prior to translation, proposed in recent years, has been quite successful in improving translation quality. We present a new technique for extracting syntax-based reordering rules, which are derived ...
متن کاملDo we need phrases? Challenging the conventional wisdom in Statistical Machine Translation
We begin by exploring theoretical and practical issues with phrasal SMT, several of which are addressed by syntax-based SMT. Next, to address problems not handled by syntax, we propose the concept of a Minimal Translation Unit (MTU) and develop MTU sequence models. Finally we incorporate these models into a syntax-based SMT system and demonstrate that it improves on the state of the art transla...
متن کاملN-Gram-Based Statistical Machine Translation versus Syntax Augmented Machine Translation: Comparison and System Combination
In this paper we compare and contrast two approaches to Machine Translation (MT): the CMU-UKA Syntax Augmented Machine Translation system (SAMT) and UPC-TALP N-gram-based Statistical Machine Translation (SMT). SAMT is a hierarchical syntax-driven translation system underlain by a phrase-based model and a target part parse tree. In N-gram-based SMT, the translation process is based on bilingual ...
متن کاملEdinburgh's Syntax-Based Machine Translation Systems
We present the syntax-based string-totree statistical machine translation systems built for the WMT 2013 shared translation task. Systems were developed for four language pairs. We report on adapting parameters, targeted reduction of the tuning set, and post-evaluation experiments on rule binarization and preventing dropping of verbs.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006